This is an exploratory data visualization project on the Prosper Loan datasets. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import warnings
warnings.simplefilter("ignore")
#importing the Medical_appointment dataset
data = pd.read_csv(r'C:\Users\Master Ridwan\Desktop\my data\prosperLoanData.csv', header = 0)
data.head(5)
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | ... | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 1 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | ... | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | ... | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 3 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | ... | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 4 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | ... | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
5 rows × 81 columns
#check for the structure of the dataset
data.shape
(113937, 81)
#Check for more imformation about the dataset(Non-Null Count, Dtype )
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113937 non-null object 1 ListingNumber 113937 non-null int64 2 ListingCreationDate 113937 non-null object 3 CreditGrade 28953 non-null object 4 Term 113937 non-null int64 5 LoanStatus 113937 non-null object 6 ClosedDate 55089 non-null object 7 BorrowerAPR 113912 non-null float64 8 BorrowerRate 113937 non-null float64 9 LenderYield 113937 non-null float64 10 EstimatedEffectiveYield 84853 non-null float64 11 EstimatedLoss 84853 non-null float64 12 EstimatedReturn 84853 non-null float64 13 ProsperRating (numeric) 84853 non-null float64 14 ProsperRating (Alpha) 84853 non-null object 15 ProsperScore 84853 non-null float64 16 ListingCategory (numeric) 113937 non-null int64 17 BorrowerState 108422 non-null object 18 Occupation 110349 non-null object 19 EmploymentStatus 111682 non-null object 20 EmploymentStatusDuration 106312 non-null float64 21 IsBorrowerHomeowner 113937 non-null bool 22 CurrentlyInGroup 113937 non-null bool 23 GroupKey 13341 non-null object 24 DateCreditPulled 113937 non-null object 25 CreditScoreRangeLower 113346 non-null float64 26 CreditScoreRangeUpper 113346 non-null float64 27 FirstRecordedCreditLine 113240 non-null object 28 CurrentCreditLines 106333 non-null float64 29 OpenCreditLines 106333 non-null float64 30 TotalCreditLinespast7years 113240 non-null float64 31 OpenRevolvingAccounts 113937 non-null int64 32 OpenRevolvingMonthlyPayment 113937 non-null float64 33 InquiriesLast6Months 113240 non-null float64 34 TotalInquiries 112778 non-null float64 35 CurrentDelinquencies 113240 non-null float64 36 AmountDelinquent 106315 non-null float64 37 DelinquenciesLast7Years 112947 non-null float64 38 PublicRecordsLast10Years 113240 non-null float64 39 PublicRecordsLast12Months 106333 non-null float64 40 RevolvingCreditBalance 106333 non-null float64 41 BankcardUtilization 106333 non-null float64 42 AvailableBankcardCredit 106393 non-null float64 43 TotalTrades 106393 non-null float64 44 TradesNeverDelinquent (percentage) 106393 non-null float64 45 TradesOpenedLast6Months 106393 non-null float64 46 DebtToIncomeRatio 105383 non-null float64 47 IncomeRange 113937 non-null object 48 IncomeVerifiable 113937 non-null bool 49 StatedMonthlyIncome 113937 non-null float64 50 LoanKey 113937 non-null object 51 TotalProsperLoans 22085 non-null float64 52 TotalProsperPaymentsBilled 22085 non-null float64 53 OnTimeProsperPayments 22085 non-null float64 54 ProsperPaymentsLessThanOneMonthLate 22085 non-null float64 55 ProsperPaymentsOneMonthPlusLate 22085 non-null float64 56 ProsperPrincipalBorrowed 22085 non-null float64 57 ProsperPrincipalOutstanding 22085 non-null float64 58 ScorexChangeAtTimeOfListing 18928 non-null float64 59 LoanCurrentDaysDelinquent 113937 non-null int64 60 LoanFirstDefaultedCycleNumber 16952 non-null float64 61 LoanMonthsSinceOrigination 113937 non-null int64 62 LoanNumber 113937 non-null int64 63 LoanOriginalAmount 113937 non-null int64 64 LoanOriginationDate 113937 non-null object 65 LoanOriginationQuarter 113937 non-null object 66 MemberKey 113937 non-null object 67 MonthlyLoanPayment 113937 non-null float64 68 LP_CustomerPayments 113937 non-null float64 69 LP_CustomerPrincipalPayments 113937 non-null float64 70 LP_InterestandFees 113937 non-null float64 71 LP_ServiceFees 113937 non-null float64 72 LP_CollectionFees 113937 non-null float64 73 LP_GrossPrincipalLoss 113937 non-null float64 74 LP_NetPrincipalLoss 113937 non-null float64 75 LP_NonPrincipalRecoverypayments 113937 non-null float64 76 PercentFunded 113937 non-null float64 77 Recommendations 113937 non-null int64 78 InvestmentFromFriendsCount 113937 non-null int64 79 InvestmentFromFriendsAmount 113937 non-null float64 80 Investors 113937 non-null int64 dtypes: bool(3), float64(50), int64(11), object(17) memory usage: 68.1+ MB
The dataset consists of 113937 rows of loan records and 81 features.
The main features of interest are the attributes that surrounds the borrowers in relation to the loan
#lets copy the original data
loan_df = data.copy()
#selecting features of interest
cols =['ListingKey', 'ListingCreationDate', 'MemberKey','EmploymentStatus', 'Occupation','IncomeVerifiable','StatedMonthlyIncome', 'IncomeRange','BorrowerAPR','BorrowerRate','BorrowerState','DebtToIncomeRatio','IsBorrowerHomeowner','CreditScoreRangeLower','CreditScoreRangeUpper','LoanKey','LoanOriginalAmount','LoanOriginationDate','LoanOriginationQuarter','Term','ClosedDate', 'TotalProsperLoans']
data_df = loan_df[cols]
data_df.head()
| ListingKey | ListingCreationDate | MemberKey | EmploymentStatus | Occupation | IncomeVerifiable | StatedMonthlyIncome | IncomeRange | BorrowerAPR | BorrowerRate | ... | IsBorrowerHomeowner | CreditScoreRangeLower | CreditScoreRangeUpper | LoanKey | LoanOriginalAmount | LoanOriginationDate | LoanOriginationQuarter | Term | ClosedDate | TotalProsperLoans | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 2007-08-26 19:09:29.263000000 | 1F3E3376408759268057EDA | Self-employed | Other | True | 3083.333333 | $25,000-49,999 | 0.16516 | 0.1580 | ... | True | 640.0 | 659.0 | E33A3400205839220442E84 | 9425 | 2007-09-12 00:00:00 | Q3 2007 | 36 | 2009-08-14 00:00:00 | NaN |
| 1 | 10273602499503308B223C1 | 2014-02-27 08:28:07.900000000 | 1D13370546739025387B2F4 | Employed | Professional | True | 6125.000000 | $50,000-74,999 | 0.12016 | 0.0920 | ... | False | 680.0 | 699.0 | 9E3B37071505919926B1D82 | 10000 | 2014-03-03 00:00:00 | Q1 2014 | 36 | NaN | NaN |
| 2 | 0EE9337825851032864889A | 2007-01-05 15:00:47.090000000 | 5F7033715035555618FA612 | Not available | Other | True | 2083.333333 | Not displayed | 0.28269 | 0.2750 | ... | False | 480.0 | 499.0 | 6954337960046817851BCB2 | 3001 | 2007-01-17 00:00:00 | Q1 2007 | 36 | 2009-12-17 00:00:00 | NaN |
| 3 | 0EF5356002482715299901A | 2012-10-22 11:02:35.010000000 | 9ADE356069835475068C6D2 | Employed | Skilled Labor | True | 2875.000000 | $25,000-49,999 | 0.12528 | 0.0974 | ... | True | 800.0 | 819.0 | A0393664465886295619C51 | 10000 | 2012-11-01 00:00:00 | Q4 2012 | 36 | NaN | NaN |
| 4 | 0F023589499656230C5E3E2 | 2013-09-14 18:38:39.097000000 | 36CE356043264555721F06C | Employed | Executive | True | 9583.333333 | $100,000+ | 0.24614 | 0.2085 | ... | True | 680.0 | 699.0 | A180369302188889200689E | 15000 | 2013-09-20 00:00:00 | Q3 2013 | 36 | NaN | 1.0 |
5 rows × 22 columns
#Check for the shape
data_df.shape
(113937, 22)
#Check for infomation about the new features
data_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113937 non-null object 1 ListingCreationDate 113937 non-null object 2 MemberKey 113937 non-null object 3 EmploymentStatus 111682 non-null object 4 Occupation 110349 non-null object 5 IncomeVerifiable 113937 non-null bool 6 StatedMonthlyIncome 113937 non-null float64 7 IncomeRange 113937 non-null object 8 BorrowerAPR 113912 non-null float64 9 BorrowerRate 113937 non-null float64 10 BorrowerState 108422 non-null object 11 DebtToIncomeRatio 105383 non-null float64 12 IsBorrowerHomeowner 113937 non-null bool 13 CreditScoreRangeLower 113346 non-null float64 14 CreditScoreRangeUpper 113346 non-null float64 15 LoanKey 113937 non-null object 16 LoanOriginalAmount 113937 non-null int64 17 LoanOriginationDate 113937 non-null object 18 LoanOriginationQuarter 113937 non-null object 19 Term 113937 non-null int64 20 ClosedDate 55089 non-null object 21 TotalProsperLoans 22085 non-null float64 dtypes: bool(2), float64(7), int64(2), object(11) memory usage: 17.6+ MB
#check for the descriptive statistics of numerical features
data_df.describe()
| StatedMonthlyIncome | BorrowerAPR | BorrowerRate | DebtToIncomeRatio | CreditScoreRangeLower | CreditScoreRangeUpper | LoanOriginalAmount | Term | TotalProsperLoans | |
|---|---|---|---|---|---|---|---|---|---|
| count | 1.139370e+05 | 113912.000000 | 113937.000000 | 105383.000000 | 113346.000000 | 113346.000000 | 113937.00000 | 113937.000000 | 22085.000000 |
| mean | 5.608026e+03 | 0.218828 | 0.192764 | 0.275947 | 685.567731 | 704.567731 | 8337.01385 | 40.830248 | 1.421100 |
| std | 7.478497e+03 | 0.080364 | 0.074818 | 0.551759 | 66.458275 | 66.458275 | 6245.80058 | 10.436212 | 0.764042 |
| min | 0.000000e+00 | 0.006530 | 0.000000 | 0.000000 | 0.000000 | 19.000000 | 1000.00000 | 12.000000 | 0.000000 |
| 25% | 3.200333e+03 | 0.156290 | 0.134000 | 0.140000 | 660.000000 | 679.000000 | 4000.00000 | 36.000000 | 1.000000 |
| 50% | 4.666667e+03 | 0.209760 | 0.184000 | 0.220000 | 680.000000 | 699.000000 | 6500.00000 | 36.000000 | 1.000000 |
| 75% | 6.825000e+03 | 0.283810 | 0.250000 | 0.320000 | 720.000000 | 739.000000 | 12000.00000 | 36.000000 | 2.000000 |
| max | 1.750003e+06 | 0.512290 | 0.497500 | 10.010000 | 880.000000 | 899.000000 | 35000.00000 | 60.000000 | 8.000000 |
#change the ListingCreationDate and ClosedDate to datetime format
data_df[['ListingCreationDate', 'LoanOriginationDate', 'ClosedDate']] =data_df[['ListingCreationDate', 'LoanOriginationDate', 'ClosedDate']].apply(pd.to_datetime, errors = 'coerce')
Feature Engineering
#extracting the specific Listing Creation Day and Listing Creation Month from the ListingCreationDate
data_df['ListingCreationDay'] = data_df['ListingCreationDate'].dt.day_name()
data_df['ListingCreationMonth'] = data_df['ListingCreationDate'].dt.month_name()
#extracting the specific Loan Origination Day and Loan Origination Month from the LoanOriginationDate
data_df['LoanOriginationDay'] = data_df['LoanOriginationDate'].dt.day_name()
data_df['LoanOriginationMonth'] = data_df['LoanOriginationDate'].dt.month_name()
data_df['LoanOriginationYear'] = pd.DatetimeIndex(data_df['LoanOriginationDate']).year
#extracting the specific ClosedDay and ClosedMonth from the ClosedDate
data_df['ClosedDay'] = data_df['ClosedDate'].dt.day_name()
data_df['ClosedMonth'] = data_df['ClosedDate'].dt.month_name()
data_df['ClosedYear'] = pd.DatetimeIndex(data_df['ClosedDate']).year
#Drop the columns that wont be neccessary for further analysis
data_df.drop(['ListingCreationDate','LoanOriginationDate','ClosedDate'], axis = 1, inplace = True)
#Show the first 5 cols
data_df.head()
| ListingKey | MemberKey | EmploymentStatus | Occupation | IncomeVerifiable | StatedMonthlyIncome | IncomeRange | BorrowerAPR | BorrowerRate | BorrowerState | ... | Term | TotalProsperLoans | ListingCreationDay | ListingCreationMonth | LoanOriginationDay | LoanOriginationMonth | LoanOriginationYear | ClosedDay | ClosedMonth | ClosedYear | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 1F3E3376408759268057EDA | Self-employed | Other | True | 3083.333333 | $25,000-49,999 | 0.16516 | 0.1580 | CO | ... | 36 | NaN | Sunday | August | Wednesday | September | 2007 | Friday | August | 2009.0 |
| 1 | 10273602499503308B223C1 | 1D13370546739025387B2F4 | Employed | Professional | True | 6125.000000 | $50,000-74,999 | 0.12016 | 0.0920 | CO | ... | 36 | NaN | Thursday | February | Monday | March | 2014 | NaN | NaN | NaN |
| 2 | 0EE9337825851032864889A | 5F7033715035555618FA612 | Not available | Other | True | 2083.333333 | Not displayed | 0.28269 | 0.2750 | GA | ... | 36 | NaN | Friday | January | Wednesday | January | 2007 | Thursday | December | 2009.0 |
| 3 | 0EF5356002482715299901A | 9ADE356069835475068C6D2 | Employed | Skilled Labor | True | 2875.000000 | $25,000-49,999 | 0.12528 | 0.0974 | GA | ... | 36 | NaN | Monday | October | Thursday | November | 2012 | NaN | NaN | NaN |
| 4 | 0F023589499656230C5E3E2 | 36CE356043264555721F06C | Employed | Executive | True | 9583.333333 | $100,000+ | 0.24614 | 0.2085 | MN | ... | 36 | 1.0 | Saturday | September | Friday | September | 2013 | NaN | NaN | NaN |
5 rows × 27 columns
Question: Which Employment Status does the dataset contain?
def Countplot(col):
plt.figure(figsize = (14,8))
countplot =sns.countplot(x = col, data = data_df, color = 'cornflowerblue' )
sns.set_style("darkgrid")
plt.tight_layout()
return countplot
#Show the plot Distribution of Employment Status by counts
Countplot('EmploymentStatus')
plt.title('Total Sum of Borrower Per EmploymentStatus', size = 15)
Text(0.5, 1.0, 'Total Sum of Borrower Per EmploymentStatus')
Question: what is the total sum of borrower we have in each states in the datasets?
#Show the plot Distribution of Employment Status by counts
Countplot('BorrowerState')
plt.title('Total Sum of Borrower Per EmploymentStatus', size = 15)
Text(0.5, 1.0, 'Total Sum of Borrower Per EmploymentStatus')
Question: what is the total sum of borrower we have in the top Ten (10)occupations in the datasets?
#Show the plot Distribution of top 10 Occupation by counts
plt.figure(figsize = (14,8))
data_df['Occupation'].value_counts().head(10).plot.bar()
plt.legend()
plt.xlabel('Occupation')
plt.ylabel('Counts')
sns.set_style("darkgrid")
plt.tight_layout()
plt.title('Total Sum of Borrower Per Occupation', size = 15)
Text(0.5, 1.0, 'Total Sum of Borrower Per Occupation')
Question: what is the distribution of borrower based on their Stated Monthly Income?
#Create an Histogram for Stated Monthly Income Distribution
plt.figure(figsize = [15, 6])
bin= np.arange(0, data_df['StatedMonthlyIncome'].max()+600, 600)
plt.hist(data= data_df, x = 'StatedMonthlyIncome', bins= bin)
plt.xlim(0, 20000)
plt.xlabel('Stated Monthly Income')
plt.ylabel('Counts')
plt.title('Distribution of Stated Monthly Income', size = 15)
Text(0.5, 1.0, 'Distribution of Stated Monthly Income')
Question: what is the distribution of borrower based on the Loan Original Amount?
#Create an Histogram for Loan Original Amount
plt.figure(figsize = [15, 6])
bin= np.arange(0, data_df['LoanOriginalAmount'].max()+600, 600)
plt.hist(data= data_df, x = 'LoanOriginalAmount', bins= bin)
plt.xlim(0, 20000)
plt.xlabel('Loan Original Amount')
plt.ylabel('Counts')
plt.title(' Loan Original Amount Distribution', size = 15)
Text(0.5, 1.0, ' Loan Original Amount Distribution')
Question: What is the distribution of borrower based on their Income Range?
#Show barchart for Income Range Distribution
plt.figure(figsize = [15, 8])
color_pal = sns.color_palette()[0]
order_type = ['$0', '$1-24,999', '$25,000-49,999', '$50,000-74,999','$75,000-99,999', '$100,000+', 'not displayed', 'Not employed']
sns.countplot(data= data_df, x= 'IncomeRange', color = color_pal, order = order_type)
plt.xlabel("Income Range")
plt.ylabel("Counts")
plt.title("Distribution of Income Range")
Text(0.5, 1.0, 'Distribution of Income Range')
Question: what is the distribution of borrower based on the Borrower Rate Distribution?
#Create an Histogram for Borrower Rate Distribution
plt.figure(figsize = [15, 6])
bin= np.arange(0, data_df['BorrowerRate'].max()+0.009,0.009)
plt.hist(data= data_df, x = 'BorrowerRate', bins= bin)
plt.xlabel('Borrower Rate')
plt.xlim(0, 0.40)
plt.ylabel('Counts')
plt.title(' Borrower Rate Distribution', size = 15)
Text(0.5, 1.0, ' Borrower Rate Distribution')
Question: which day of the week did Listing Creation occur most?
#show a doughnut chat Dritribution of Listing Creation Day
plt.figure(figsize = (10, 8))
sorted_counts = data_df['ListingCreationDay'].value_counts()
labels = ['Monday', 'Tuesday', 'Wednesday', 'Thurday', 'Friday', 'Saturday', 'Sunday']
plt.pie(sorted_counts, labels = labels, startangle = 90,
counterclock = False, wedgeprops = {'width' : 0.4}, autopct="%1.1f%%")
plt.axis('square')
plt.legend()
plt.title('Listing Creation Day Ditribution', size =15)
Text(0.5, 1.0, 'Listing Creation Day Ditribution')
Question: which day of the week did Loan Origination occur most?
#show a doughnut chat Dritribution of Loan Origination Day
plt.figure(figsize = (10, 8))
sorted_counts = data_df['LoanOriginationDay'].value_counts()
labels = ['Monday', 'Tuesday', 'Wednesday', 'Thurday', 'Friday']
plt.pie(sorted_counts, labels = labels, startangle = 90,
counterclock = False, wedgeprops = {'width' : 0.4}, autopct="%1.1f%%")
plt.axis('square')
plt.legend()
plt.title('Loan Origination Day Ditribution')
Text(0.5, 1.0, 'Loan Origination Day Ditribution')
Question: which day of the week is the most Closing Day?
#show a doughnut chat Dritribution of ClosedDay
plt.figure(figsize = (10, 8))
sorted_counts = data_df['ClosedDay'].value_counts()
labels = ['Monday', 'Tuesday', 'Wednesday', 'Thurday', 'Friday', 'Saturday', 'Sunday']
plt.pie(sorted_counts, labels = labels, startangle = 90,
counterclock = False, wedgeprops = {'width' : 0.4}, autopct="%1.1f%%")
plt.axis('square')
plt.legend()
plt.title('Loan Closed Day Day Ditribution')
Text(0.5, 1.0, 'Loan Closed Day Day Ditribution')
Question: which month of the year did Loan Origination occur most?
#plot for patients that showed up and those that do not show based on variable named Scholarship
plt.figure(figsize = (14,8))
order_type = ['January','February','March','April','May','June', 'July', 'August', 'September', 'October', 'November', 'December']
sns.countplot (x = "LoanOriginationMonth" , data = data_df, color = 'cornflowerblue', order = order_type)
plt.title('Loan Origination Month Distribution', size = 15)
plt.xticks(rotation = 90)
sns.set_style("darkgrid")
plt.tight_layout()
Question: which year did Loan Origination occur most?
#Create an bar chart for Loan Origination Year Distribution
Countplot('LoanOriginationYear')
plt.title('Loan Origination Year Distribution', size = 15)
Text(0.5, 1.0, 'Loan Origination Year Distribution')
Question: What are the number Terms we have in the dataset?
#Create an bar chart for Term Distribution
plt.figure(figsize = [14, 8])
sorted_counts = data_df['Term'].value_counts()
plt.pie(sorted_counts, labels = sorted_counts.index, startangle = 90, counterclock = False);
plt.axis('square')
plt.legend()
plt.title('Number of Term Ditribution', size = 15)
Text(0.5, 1.0, 'Number of Term Ditribution')
After I explored the distributions of various interested features in the dataset. It was discovered that Most borrowers are employed,very few of them are retired, they are mostly Professional and least of them are clerical. many of the borrower have an income range from ($)25,000-74,999. Their income ratio is right skewed. Exploring the distribution of different states,it was found out that the state CA have the highest number of borrowers. Most loans were Listed and originated on Tuesday while the least was Listed on Sunday and originated on monday. This year 2013 have the highest count of loan origination
Of the features I investigated,I changed some feature's datatype such as ListCreationDate which was previously in object Dtype to the appropriate form which is datetime. I extracted some specific day, month and year from ListingCreationDay, LoanOriginationDay and ClosedDay. This will enable me to explore the dataset more.
data_df.head()
| ListingKey | MemberKey | EmploymentStatus | Occupation | IncomeVerifiable | StatedMonthlyIncome | IncomeRange | BorrowerAPR | BorrowerRate | BorrowerState | ... | Term | TotalProsperLoans | ListingCreationDay | ListingCreationMonth | LoanOriginationDay | LoanOriginationMonth | LoanOriginationYear | ClosedDay | ClosedMonth | ClosedYear | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 1F3E3376408759268057EDA | Self-employed | Other | True | 3083.333333 | $25,000-49,999 | 0.16516 | 0.1580 | CO | ... | 36 | NaN | Sunday | August | Wednesday | September | 2007 | Friday | August | 2009.0 |
| 1 | 10273602499503308B223C1 | 1D13370546739025387B2F4 | Employed | Professional | True | 6125.000000 | $50,000-74,999 | 0.12016 | 0.0920 | CO | ... | 36 | NaN | Thursday | February | Monday | March | 2014 | NaN | NaN | NaN |
| 2 | 0EE9337825851032864889A | 5F7033715035555618FA612 | Not available | Other | True | 2083.333333 | Not displayed | 0.28269 | 0.2750 | GA | ... | 36 | NaN | Friday | January | Wednesday | January | 2007 | Thursday | December | 2009.0 |
| 3 | 0EF5356002482715299901A | 9ADE356069835475068C6D2 | Employed | Skilled Labor | True | 2875.000000 | $25,000-49,999 | 0.12528 | 0.0974 | GA | ... | 36 | NaN | Monday | October | Thursday | November | 2012 | NaN | NaN | NaN |
| 4 | 0F023589499656230C5E3E2 | 36CE356043264555721F06C | Employed | Executive | True | 9583.333333 | $100,000+ | 0.24614 | 0.2085 | MN | ... | 36 | 1.0 | Saturday | September | Friday | September | 2013 | NaN | NaN | NaN |
5 rows × 27 columns
data_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113937 non-null object 1 MemberKey 113937 non-null object 2 EmploymentStatus 111682 non-null object 3 Occupation 110349 non-null object 4 IncomeVerifiable 113937 non-null bool 5 StatedMonthlyIncome 113937 non-null float64 6 IncomeRange 113937 non-null object 7 BorrowerAPR 113912 non-null float64 8 BorrowerRate 113937 non-null float64 9 BorrowerState 108422 non-null object 10 DebtToIncomeRatio 105383 non-null float64 11 IsBorrowerHomeowner 113937 non-null bool 12 CreditScoreRangeLower 113346 non-null float64 13 CreditScoreRangeUpper 113346 non-null float64 14 LoanKey 113937 non-null object 15 LoanOriginalAmount 113937 non-null int64 16 LoanOriginationQuarter 113937 non-null object 17 Term 113937 non-null int64 18 TotalProsperLoans 22085 non-null float64 19 ListingCreationDay 113937 non-null object 20 ListingCreationMonth 113937 non-null object 21 LoanOriginationDay 113937 non-null object 22 LoanOriginationMonth 113937 non-null object 23 LoanOriginationYear 113937 non-null int64 24 ClosedDay 55089 non-null object 25 ClosedMonth 55089 non-null object 26 ClosedYear 55089 non-null float64 dtypes: bool(2), float64(8), int64(3), object(14) memory usage: 21.9+ MB
Questions: What is the relationship between Borrower Rate and Loan Original Amount?
sns.set(rc={'figure.figsize':(10,8)}, style="whitegrid")
sns.regplot(x='BorrowerRate', y='LoanOriginalAmount', data=data_df)
plt.xlabel('BorrowerRate')
plt.ylabel('LoanOriginalAmount')
plt.title('relationship between the Loan Original Amount and Borrower Rate')
data_corr = data_df.corr()
print("The Correlation Between BorrowerRate And LoanOriginalAmount is ",data_corr.loc['BorrowerRate','LoanOriginalAmount']);
The Correlation Between BorrowerRate And LoanOriginalAmount is -0.3289599499385382
Questions: What is the correlation between the numerical variables?
cols =['StatedMonthlyIncome','BorrowerAPR','BorrowerRate','DebtToIncomeRatio','CreditScoreRangeLower','CreditScoreRangeUpper','LoanOriginalAmount','Term']
corr_df = data_df[cols]
#shows the heapmap for the relationship betwen some numercal variables
plt.figure(figsize = (14,6))
heapmap =sns.heatmap (corr_df.corr(), annot = True, cmap = 'RdYlGn', linewidths = .9)
plt.title('Relationship between two numerical variables', size = 20)
Text(0.5, 1.0, 'Relationship between two numerical variables')
Questions: What is distribution of StatedMonthlyIncome for each EmploymentStatus?
#show the boxplot of StatedMonthlyIncome distribution for EmploymentStatus
import plotly.express as px
fig = px.box(data_df, x = 'EmploymentStatus', y = 'StatedMonthlyIncome',range_y=[0,25000])
fig.show()
Questions: What is relationship between LoanOriginalAmount and term?
#Show barchart for the relationship between LoanOriginalAmount and term
plt.figure(figsize = [15, 6])
color_pal = sns.color_palette()[0]
sns.barplot(data= data_df, x= 'Term', y = 'LoanOriginalAmount' , color = color_pal)
plt.xlabel("Term")
plt.ylabel("LoanOriginalAmount")
plt.title("The Sum Distribution of Loan Original Amount in respect to the Terms ", size =20)
Text(0.5, 1.0, 'The Sum Distribution of Loan Original Amount in respect to the Terms ')
Question: what is the relationship between IncomeRange and CreditScoreRangeUpper?
#Show barchart for the relationship between IncomeRange and CreditScoreRangeUpper
plt.figure(figsize = [15, 7])
color_pal = sns.color_palette()[0]
order_type = ['$0', '$1-24,999', '$25,000-49,999', '$50,000-74,999','$75,000-99,999', '$100,000+', 'not displayed', 'Not employed' ]
sns.barplot(data= data_df, x= 'IncomeRange', y = 'CreditScoreRangeUpper' , color = color_pal, order = order_type)
plt.xlabel("IncomeRange")
plt.ylabel("CreditScoreRangeUpper")
plt.title("The Sum Distribution of Credit Score Range Upper in respect to the Income Range ", size =20)
Text(0.5, 1.0, 'The Sum Distribution of Credit Score Range Upper in respect to the Income Range ')
During the Investigation, the relationship between two features were studied. i discovered that, of all the numerical features in the datasets, the correlation between the Borrower APR and Borrower Rate was the highest which is 0.99. moreso, It was discovered that the borrower interest Rate and loan original amount are negatively correlated, which means the more the loan amount, the lower the Borrower Rate. however,The loan original amount is positively correlated to the stated monthly income. That is, the higher their stated monthly income, the higher the loan amount borrowed.
The relationship between the each Employment status and original loan amount shows that those who are employed and fulltime on average take out larger loans than other groups. in the case of original loan amount and terms,their relationship indicates that the higher the orignal amount of the loan, the higher length of the loan express in months.
Apart from the main features of interest, I also observed that:
There more of only 1 investors
The LoanStatus shows that we have more loan that are currently active even more than those that are completed. it was also observed that there is a very rare case of a loan being cancelled.
There are less recommedations (mostly 0)
Borrowers with income ranging from ($)50,000-100,000+ are majorly homeowners
data_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113937 non-null object 1 MemberKey 113937 non-null object 2 EmploymentStatus 111682 non-null object 3 Occupation 110349 non-null object 4 IncomeVerifiable 113937 non-null bool 5 StatedMonthlyIncome 113937 non-null float64 6 IncomeRange 113937 non-null object 7 BorrowerAPR 113912 non-null float64 8 BorrowerRate 113937 non-null float64 9 BorrowerState 108422 non-null object 10 DebtToIncomeRatio 105383 non-null float64 11 IsBorrowerHomeowner 113937 non-null bool 12 CreditScoreRangeLower 113346 non-null float64 13 CreditScoreRangeUpper 113346 non-null float64 14 LoanKey 113937 non-null object 15 LoanOriginalAmount 113937 non-null int64 16 LoanOriginationQuarter 113937 non-null object 17 Term 113937 non-null int64 18 TotalProsperLoans 22085 non-null float64 19 ListingCreationDay 113937 non-null object 20 ListingCreationMonth 113937 non-null object 21 LoanOriginationDay 113937 non-null object 22 LoanOriginationMonth 113937 non-null object 23 LoanOriginationYear 113937 non-null int64 24 ClosedDay 55089 non-null object 25 ClosedMonth 55089 non-null object 26 ClosedYear 55089 non-null float64 dtypes: bool(2), float64(8), int64(3), object(14) memory usage: 21.9+ MB
Questions: What is the Distribution of Loan Original Amount for the Terms per IncomeVerifiable
#Show barchart for the realationship between LoanOriginalAmount and term
plt.figure(figsize = [15, 6])
color_pal = sns.color_palette()[0]
sns.barplot(data= data_df, x= 'Term', y = 'LoanOriginalAmount' , hue = 'IncomeVerifiable', color = color_pal)
plt.xlabel("Term")
plt.ylabel("LoanOriginalAmount")
plt.title("The Sum Distribution of Loan Original Amount vs Term in IncomeVerifiable ")
Text(0.5, 1.0, 'The Sum Distribution of Loan Original Amount vs Term in IncomeVerifiable ')
Question: what is the influence of the Term on relationship between DebtToIncomeRatio and loan amount
#Show barchart for the ralationship between LoanOriginalAmount and DebtToIncomeRatio with respect to the tems
plt.figure(figsize = [15, 6])
color_pal = sns.color_palette()[0]
sns.barplot(data= data_df, x= 'LoanOriginationYear', y = 'DebtToIncomeRatio' , hue = 'Term', color = color_pal)
plt.xlabel("Term")
plt.ylabel("LoanOriginalAmount")
plt.title("The Sum Distribution of Loan Original Amount vs Debt To Income Ratio in respect to the Terms")
Text(0.5, 1.0, 'The Sum Distribution of Loan Original Amount vs Debt To Income Ratio in respect to the Terms')
Question:what is the influence of the Term on relationship between BorrowerRate and loan amount
# Check the influence of the Term on relationship between BorrowerRate and loan amount
group=sns.FacetGrid(data=data_df, aspect=1.2, height=4, col='Term', col_wrap=3)
group.map(sns.regplot, 'LoanOriginalAmount', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1});
group.add_legend();
In this part of investigation, it shows that those with an Income verifiable tends to borrow an higher amount of loan and have a longer term. Also Debt To Income were higher in 2007 when we have a term of 35
Those with verifiable Income tends to borrow an higher amount of loan and have a longer term than those with no verifiable income.
During the investigation of the Prosper Loan, I focused more on the specific features of interest. Exploratory analysis was performed on individual variables as well as the relationship between two or more variables.
From the Univariant Analysis, It was deduced that Most borrowers are employed,very few of them are retired, they are mostly Professional and least of them are clerical. many of the borrower have an income range from ($)25,000-74,999 and most of them have stated monthly income that is less than 5000 Their income ratio is right skewed. Exploring the distribution of different states,it was found out that the state CA have the highest number of borrowers. Most loans were Listed and originated on Tuesday while the least was Listed on Sunday and originated on monday. This year 2013 have the highest count of loan origination
From the Bivariant Exploration during which the relationship between two features were studied. At first, the correlations between different numerical features was discovered with heatmap, the correlation between the Borrower APR and Borrower Rate was the highest which is 0.99. moreso, It was discovered that the borrower interest Rate and loan original amount are negatively correlated, which means the more the loan amount, the lower the Borrower Rate. however,The loan original amount is positively correlated to the stated monthly income. That is, the higher their stated monthly income, the higher the loan amount borrowed.
The relationship between the each Employment status and original loan amount shows that those who are employed and fulltime on average take out larger loans than other groups. in the case of original loan amount and terms,their relationship indicates that the higher the orignal amount of the loan, the higher length of the loan express in months.
Investigating further on Multivariate Exploration where relationship between three or more features were studied and also explore how each of those feature influenced one another. It was deduced that those with an Income verifiable tends to borrow an higher amount of loan and have a longer term. Also Debt To Income were higher in 2007 when we have a term of 35 months
Reference